The MITTENSS Research Workshop
2026-01-30
We only have 2.5 hours… 😱
Let’s be realistic
Get ideas about how to use LLM to process data via API, and modify the workflow to your own data/projects
Know where to start and the learning path
Github repo https://github.com/ibertchen/epic_llm_workshop
Posit Cloud https://posit.cloud if you don’t have R installed on your computer
A simplified workflow:
An Application Programming Interface (API) is a set of rules that allows different software applications (e.g., R and OpenAI LLMs) to communicate and share data or functionality seamlessly.
Most mainstream LLM providers (OpenAI, Google, Anthropic, Hugging Face) have API services.
| Feature | ChatGPT (The Product) | OpenAI API (The Engine) |
|---|---|---|
| Interface | A website/app you can chat with. | A “pipe” connecting your code to the AI. |
| Setup | None (Sign in and start typing). | Requires coding (Python, R, etc.). |
| Memory | Remembers you across conversations. | You must “teach” it memory yourself. |
| Pricing | Flat monthly fee (Free, $20, or $200). | Pay-per-use (billed by the word/token). |
| Control | Limited (OpenAI sets the rules). | High (You control creativity and logic). |
Regarding data processing
Large-Scale Automation
Batch Processing: You can process thousands of documents, survey responses, or datasets at once. While the Chat app requires you to copy-paste or upload files manually, the API can run 24/7 in the background without human intervention.
Cost Efficiency: As of 2026, the major service providers offer a “Batch Mode” for the API that gives you a 50% discount if you don’t need the results instantly (e.g., overnight processing).
Precise Control
System Instructions: You can hard-code a specific “persona” or set of rules that the AI must follow for every single query.
Temperature & Top-P: You can adjust the “creativity” level. In research, you might set the Temperature to 0 to ensure the AI gives the same, most likely answer every time, minimizing “hallucinations.”
Data Privacy & Security
Training Exclusion: Typically, data sent through the API is not used by the service providers to train their future models. This is crucial for researchers handling sensitive interview transcripts or unpublished datasets.
Institutional Compliance: The API allows for enterprise-grade security that often meets university ethics and IRB requirements.
Structured Output
Data format: You can explicitly control the API to produce data in a specific format (e.g., string, numbers, date and time, etc).
File format: You can force the API to give you JSON or CSV data. This means the AI’s response is already formatted as a spreadsheet or database entry, ready for analysis in STATA, SPSS, or Excel.
Important
Follow the MSU AI policies, espeically for confidential data.
(I am watching you!)
Broady speaking, API services can be categorized into real-time APIs and batch APIs.
| Feature | Realtime API | Batch API |
|---|---|---|
| Primary Goal | Ultra-low latency / Interaction | Cost efficiency / Volume |
| Response Time | Immediate (usually <500ms) | Up to 24 hours |
| Cost | Standard | 50% Discount |
| Connection | WebSockets / WebRTC (Streaming) | File Upload (JSONL) |
| Best Modality | Speech-to-Speech, Live Text | Bulk Text, Embeddings, Eval |
| Rate Limits | Standard | Significantly higher limits |
Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages. (source)
https://www.markdownguide.org/
Markdown is incredibly helpful when working with LLM API services. It acts as a universal bridge between how humans read data and how machines process text.
Structural Clarity: Markdown helps the model understand the hierarchy and importance of your prompt. This leads to more organized and relevant responses.
Separation of Concerns: Markdown allows you to clearly separate different parts of a prompt. This prevents the model from getting confused between your commands and the data it needs to process.
Precise Code Handling: Markdown provides a standard way to denote code blocks, allowing the model to explicitly read your code as instructions or examples and generate outcomes in appropriate formats.
❗Even most of the content generated by AIs are using Markdown.
https://www.json.org/json-en.html
JSON (JavaScript Object Notation) has become the standard for structured data exchange on the web. The data format provides a unique balance between machine efficiency and human readability.
The major advantages of JSON:
"name": "value"), resulting in smaller file sizes and faster data transmission.{} and Arrays []{
"name": "Pikachu",
"category": "Mouse Pokémon",
"is_legendary": false,
"skill": ["Static", "Lightning Rod"],
"primary_type": "Electric",
"z_move_eligible": true
},
{
"name": "Eevee",
"category": "Evolution Pokémon",
"is_legendary": false,
"skill": ["Run Away", "Adaptability", "Anticipation"],
"primary_type": "Normal",
"evolution_count": 8
}
How many times have you gotten the same outcome from asking an AI the same question?
The functionality of generating structured output reduces randomness and variation within LLMs.
OpenAI’s Structured model output
Google Gemini’s Structured outputs
LLMs with structured output functionality can
We can ask LLMs to generate content that adheres to the provided JSON schema.
JSON Schema is a declarative language for defining structure and constraints for JSON data
Structured Outputs is a feature that ensures the model will always generate responses that adhere to your supplied JSON Schema (source)
{
"title": "PokemonSchema",
"required": [
"name",
"category",
"is_legendary",
"skill",
"primary_type"
],
"properties": {
"name": {
"type": "string"
},
"category": {
"type": "string"
},
"is_legendary": {
"type": "boolean"
},
"skill": {
"type": "array",
"items": {
"type": "string"
},
},
"primary_type": {
"type": "string"
},
"z_move_eligible": {
"type": "boolean"
},
"evolution_count": {
"type": "integer"
}
}
}
FINALLY!
Always go for system prompt customization first (Aka prompt engineering)
Retrieval-Augmented Generation (RAG)
The last is Fine-Turning
Start with the best models (OpenAI, Google, Claude)
Do not start with non-public information or data